{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Example usage of the `Document` class for data manipulation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import module\n",
    "\n",
    "from dygie.data.dataset_readers import document"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ACE event data\n",
    "\n",
    "Load in a dataset and print a brief description."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset with 77 documents.\n"
     ]
    }
   ],
   "source": [
    "dataset = document.Dataset.from_jsonl(\"../data/ace-event/normalized-data/default-settings/json/dev.json\")\n",
    "print(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Grab a document, and print out the sentences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0: CNN_CF_20030303.1900.02\n",
      "1: STORY\n",
      "2: 2003 - 03 - 03T19:00:00 - 05:00\n",
      "3: New Questions About Attacking Iraq ; Is Torturing Terrorists Necessary ?\n",
      "4: NOVAK Welcome back .\n",
      "5: Orders went out today to deploy 17,000 U.S. Army soldiers in the Persian Gulf region .\n",
      "6: The army 's entire first Calvary division based at Fort Hood , Texas , would join the quarter million U.S. forces already in the region .\n",
      "7: We 're talking about possibilities of full scale war with former Congressman Tom Andrews , Democrat of Maine .\n",
      "8: He 's now national director of Win Without War , and former Congressman Bob Dornan , Republican of California .\n",
      "9: BEGALA Bob , one of the reasons I think so many Americans are worried about this war and so many people around the world do n't want to go is there have been a lot of problems with credibility from this administration .\n",
      "10: Our president has repeatedly , for example , relied on a man whom you 're aware , Hussein Kamel , Saddam Hussein 's son - in - law , leader of the Iraq arms program who defected for a time .\n",
      "11: And gave us a whole lot of information and then went home and his father - in - law killed him .\n",
      "12: Bad move .\n",
      "13: But while he was here , he gave us a whole lot of information .\n",
      "14: Gave us a whole lot of information .\n",
      "15: Well , our president told us that information proves that the dictator had chemical weapons , which is true .\n",
      "16: But what we just learned this week from \" Newsweek \" magazine which got a hold of the debriefings , is that he also told us it was destroyed back in 1995 .\n",
      "17: Why has n't our president told us that ?\n"
     ]
    }
   ],
   "source": [
    "doc = dataset[0]\n",
    "print(doc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Grab a single sentence from the document, and print. The characters will be shown, with character indices underneath.fd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Orders went out today to deploy 17,000 U.S. Army soldiers in the Persian Gulf region .\n",
      "0      1    2   3     4  5      6      7    8    9        10 11  12      13   14     15\n"
     ]
    }
   ],
   "source": [
    "sent = doc[5]\n",
    "print(sent)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Examine a named entity in the dataset. Printing an entity mention shows:\n",
    "- The token indices in the current sentence.\n",
    "- The mention text.\n",
    "- The entity type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(7, 7, ['U.S.']): GPE\n"
     ]
    }
   ],
   "source": [
    "ner = sent.ner[0]\n",
    "print(ner)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Entities are as spans. Spans \"know\" what sentence they're in, and also know their indices with respect to the sentence and the document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "span = ner.span"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Printing the span shows the start and end indices, and the text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(7, 7, ['U.S.'])\n"
     ]
    }
   ],
   "source": [
    "print(span)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Spans have a pointer back to the sentence they're from."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Orders went out today to deploy 17,000 U.S. Army soldiers in the Persian Gulf region .\n",
      "0      1    2   3     4  5      6      7    8    9        10 11  12      13   14     15\n"
     ]
    }
   ],
   "source": [
    "print(span.sentence)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Spans also know their indices relative to the sentence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "print(span.start_sent)\n",
    "print(span.end_sent)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And to the document they're part of"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "31\n",
      "31\n"
     ]
    }
   ],
   "source": [
    "print(span.start_doc)\n",
    "print(span.end_doc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Relations are represented as two spans and a relation label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(8, 8, ['Army']), (7, 7, ['U.S.']): PART-WHOLE.Subsidiary\n"
     ]
    }
   ],
   "source": [
    "rel = sent.relations[0]\n",
    "print(rel)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Events are represented as a trigger followed by a last of argument spans."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<(5, 'deploy', Movement.Transport):\n",
      "      (9, 9, ['soldiers'], Movement.Transport, Artifact);\n",
      "      (14, 14, ['region'], Movement.Transport, Destination)>\n"
     ]
    }
   ],
   "source": [
    "ev = sent.events[0]\n",
    "print(ev)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can convert a document back to a json-style dict that matches the DyGIE data format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dict_keys(['doc_key', 'dataset', 'sentences', 'ner', 'relations', 'events'])\n"
     ]
    }
   ],
   "source": [
    "js = doc.to_json()\n",
    "print(js.keys())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also split a long document up into shorter segments. This is useful for dealing with documents that are too large to fit into GPU.\n",
    "\n",
    "**CAVEAT**: This functionality isn't implemented yet for coref annotations. These are challenging because they cross sentence boundaries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0: The army 's entire first Calvary division based at Fort Hood , Texas , would join the quarter million U.S. forces already in the region .\n",
      "1: We 're talking about possibilities of full scale war with former Congressman Tom Andrews , Democrat of Maine .\n"
     ]
    }
   ],
   "source": [
    "small_docs = doc.split(max_tokens_per_doc=50)\n",
    "print(small_docs[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The document-level span indices update when the document is split."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n"
     ]
    }
   ],
   "source": [
    "print(small_docs[1][0].ner[0].span.start_doc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## SciERC data\n",
    "\n",
    "Unlike ACE event, the SciERC data have coreference annotations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0: An entity-oriented approach to restricted-domain parsing is proposed .\n",
      "1: In this approach , the definitions of the structure and surface representation of domain entities are grouped together .\n",
      "2: Like semantic grammar , this allows easy exploitation of limited domain semantics .\n",
      "3: In addition , it facilitates fragmentary recognition and the use of multiple parsing strategies , and so is particularly useful for robust recognition of extra-grammatical input .\n",
      "4: Several advantages from the point of view of language definition are also noted .\n",
      "5: Representative samples from an entity-oriented language definition are presented , along with a control structure for an entity-oriented parser , some parsing strategies that use the control structure , and worked examples of parses .\n",
      "6: A parser incorporating the control structure and the parsing strategies is currently under implementation .\n"
     ]
    }
   ],
   "source": [
    "dataset = document.Dataset.from_jsonl(\"../data/scierc/normalized_data/json/dev.json\")\n",
    "doc = dataset[2]\n",
    "print(doc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the coreference clusters. The coreference clusters are written like:\n",
    "\n",
    "`[cluster-id]: [[<sent_index> (span_start, span_end), span_text], ...]`\n",
    "\n",
    "So, cluster 0 has two mentions: \"this\" in sentence 2, and \"it\" in sentence 3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0: [<2> (4, 4, ['this']), <3> (3, 3, ['it'])]\n",
      "\n",
      "1: [<0> (1, 2, ['entity-oriented', 'approach']), <1> (2, 2, ['approach'])]\n",
      "\n",
      "2: [<5> (21, 22, ['parsing', 'strategies']), <6> (8, 9, ['parsing', 'strategies'])]\n",
      "\n",
      "3: [<5> (13, 14, ['control', 'structure']), <5> (26, 27, ['control', 'structure']), <6> (4, 5, ['control', 'structure'])]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for clust in doc.clusters:\n",
    "    print(clust)\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Predictions\n",
    "\n",
    "You can also load in predicted data. The code will populate attributes for `predicted_ner`, `predicted_relations`, etc. That match the attributes we've shown already."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
